Exploratory Data Analysis (EDA)

Automatic Car Classification with AI

TEAM 1 :
Enzo Gianotti
Nicolas Passadore
Carlos Alberto Gómez Prado
Nahuel Garcia
Agustin Genoud
Jhoeel Luna
Descripción de la imagen

Table of contents

Objective : visualize and make the right decisions to clean the data set.

1. Preparation

1.1 Import libraries

1.2 Data Processed

1.3 Quick view of images

1.4 Data extraction

2. EDA

3. pandas-profiling

4. Resources

5. Conclusions


1. Preparation

1.1 Import libraries

1.2 Data Processed

1.2.1 Download & Preprocessing
The notebooks used will be available in the project's notebook folder, as well as a link to Gitlab dataset_cleaning.

1.3 Quick view of images

Random display of the first 5 images.

Random display of the first 5 images.

Random display of the first 5 images.

The total samples are 72569 for P(2).

Random display of the first 5 images.

The total samples are 72569 for P(3).

1.4 Data extraction

Extraction of a list of classes and storage in CSV from the image directory.

2. EDA

Statistics

Statistics

Stacking Per brand

Stacking Per Model

Top 19 most significant year with their absolute and relative frequency.

List extracted from all folders in the s3 bucket, in csv format.
This dataset contains information extracted from the bucket before it was downloaded, and was compared to a downloaded list. It was found that there were empty folders with no images, some with size = 0 and some corrupted files that were checked at download time.

In the search for the best cleaning model for the preprocessing, we created our own sequential model with keras and tested other architectures and techniques to achieve better performance, models such as yolo5, yolo7, Mobilenet, Mobilenet_v2, Mobilenet_v3, Densenet. Project

3 . Pandas-profiling

Generate profile reports from a DataFrame of pandas

4. Resources

Methodology(IBM)

Libs

AWS_Reference
matplotlib
seaborn
pandas
pandas-profiling
PIL
glob
sys
os
datetime
pathlib
cv2
numpy
random
tensorflow
warnings
Models_Keras

Papers

YOLOv5
YOLOv7
MobileNets
MobileNets_V2
MobileNets_v3

5. Conclusions

Descripción de la imagen

5.1 Research

5.2 Raw Data

5.3 Preprocessing

5.4 Final conclusions

Carlos
The graphs show an improvement with data processing by filtering quality data, this will give us a good model and improve our results.

Nico
The quality of the source data, establish the need to make some decisions, the reduction of classes is one of them, in addition to balancing the classes establishing a total number of images for each one of them.

Enzo
To better balance the data we could use clustering and improve the data if we look for more images.

Jhoel
Corrupted data and missing images should be discarded to avoid bias.

Agus
There are images that cannot be opened, with a script we could delete them to avoid time reducing the size of the data.

Nahu
To balance the data we would have to think of a criterion and filter out everything that is not a car, it will be necessary to use several models.